TH3: Prototyping Modules for Geospatial Analytics Shiny Application

Take Home Exercise 3

Author

Kendrick Teo

Published

October 14, 2024

Modified

November 1, 2024

TH3.1 Introduction

The objective of Take-Home Exercise 3 is to prototype a module to be built for the Geospatial Analytics project.

By using hot and cold spot analysis and building a hedonic pricing model, the aim of the project is to analyse the trajectory and factors influencing housing prices in Johor Bahru. As the group member in charge of exploratory data analysis, this take-home exercise serves to prototype the module for performing hot and cold spot analysis.

pacman::p_load(olsrr, ggstatsplot, sf, tmap, tidyverse, gtsummary, performance, see, sfdep, tidygeocoder)

TH3.2 Loading data

Today’s exercise will involve going through the entire data acquisition process from start to finish. This includes sourcing and cleaning the data before it is used for analysis.

TH3.2.1 Loading geospatial data

We will be using Malaysia’s municipality polygons, sourced from the United Nations Humanitarian Data Exchange.

adm3_my <- st_read(dsn = 'shiny/data/geospatial/myadm3', layer = 'geoBoundaries-MYS-ADM3') %>% 
  st_transform(4390)
Reading layer `geoBoundaries-MYS-ADM3' from data source 
  `/Users/kendricktty/Gits/smu_cs/is415-site/TakeHome/TakeHome3/shiny/data/geospatial/myadm3' 
  using driver `ESRI Shapefile'
Simple feature collection with 1859 features and 5 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: 98.93646 ymin: 0.8538234 xmax: 115.6782 ymax: 6.726112
Geodetic CRS:  WGS 84
qtm(adm3_my)

The overall shape of the plot is as above, corresponding to Peninsula (West) Malaysia, Sabah and Sarawak.

TH3.2.2 Loading property transaction data

property_transaction_data <- read_csv('shiny/data/aspatial/Open Transaction Data.csv')
New names:
Rows: 20517 Columns: 14
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(12): Property Type, District, Mukim, Scheme Name/Area, Road Name, Month... num
(1): Land/Parcel Area lgl (1): ...14
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `Unit` -> `Unit...9`
• `Unit` -> `Unit...11`
• `` -> `...14`
head(property_transaction_data)
# A tibble: 6 × 14
  `Property Type`                District   Mukim `Scheme Name/Area` `Road Name`
  <chr>                          <chr>      <chr> <chr>              <chr>      
1 1 - 1 1/2 Storey Semi-Detached Johor Bah… Band… KAW ABDUL SAMAD/M… JLN ISMAIL 
2 1 - 1 1/2 Storey Semi-Detached Johor Bah… Band… KAW ABDUL SAMAD/M… JLN ABDUL …
3 1 - 1 1/2 Storey Semi-Detached Johor Bah… Band… KAW ABDUL SAMAD/M… JLN YUSOF …
4 1 - 1 1/2 Storey Semi-Detached Johor Bah… Band… KIM TENG PARK      JLN BERLIAN
5 1 - 1 1/2 Storey Semi-Detached Johor Bah… Band… KIM TENG PARK      JLN UNGKU …
6 1 - 1 1/2 Storey Semi-Detached Johor Bah… Band… LARKIN GARDEN      JLN DATO J…
# ℹ 9 more variables: `Month, Year of Transaction Date` <chr>, Tenure <chr>,
#   `Land/Parcel Area` <dbl>, Unit...9 <chr>, `Main Floor Area` <chr>,
#   Unit...11 <chr>, `Unit Level` <chr>, `Transaction Price` <chr>, ...14 <lgl>

TH3.3 Data Preprocessing

TH3.3.1 Filtering Mukims outside Johor Bahru

We are only concerned with Mukims within Johor Bahru and Kulai district - the latter of which contains Johor Bahru’s main Senai international airport. Hence, we will need to filter these out.

johor_kulai_mukims <- c("MUKIM JELUTONG", "MUKIM PLENTONG", "MUKIM PULAI", "MUKIM SUNGAI TIRAM", "MUKIM TANJUNG KUPANG", "MUKIM TEBRAU", "BANDAR JOHOR BAHRU", "MUKIM BUKIT BATU", "MUKIM KULAI", "BANDAR KULAI", "MUKIM SEDENAK", "MUKIM SENAI")
adm3_jb_kulai <- adm3_my %>%
  filter(shapeName %in% johor_kulai_mukims)
tmap_mode('view')
tmap mode set to interactive viewing
qtm(adm3_jb_kulai)

As we can see, there are still some stray polygons located very far north from Johor Bahru and Kulai proper that we have to remove. A quick inspection of the dataframe in memory indicates they are associated with the Mukims of Pulai and Jelutong, and carry the IDs 4, 5, 6, 16.

The code chunk below will break up the multipolygons into simple polygons, and remove the strays. The result is the correct, contiguous borders of the Mukims in Johor Bahru and Kulai districts.

broken_up_jb <- adm3_jb_kulai %>%
  st_cast("POLYGON")
Warning in st_cast.sf(., "POLYGON"): repeating attributes for all
sub-geometries for which they may not be constant
stray_polygons <- c(4, 5, 6, 16)
adm3_jb_kulai <- broken_up_jb %>%
  filter(!row_number() %in% stray_polygons)

qtm(adm3_jb_kulai)

TH3.3.2 Geocoding Property Data

The raw dataset encodes information about street names, but we need latitudes and longitudes to plot the locations on a map. Therefore, it would be necessary to geocode the data. A dataset as large as ours took at least 2 hours to geocode, so it would be useful to save it as an RDS for later.

property_transaction_sf <- property_transaction_data %>%
  mutate(Address = if_else(
    !is.na(`Road Name`),
    paste(`Road Name`, District, sep = " "),          # Concatenate "Road Name" and "District" if "Road Name" is not NA
    paste(`Scheme Name/Area`, District, sep = " ")    # Else, concatenate "Scheme Name/Area" and "District"
  )) %>%
  geocode(address = Address, method = "osm", lat = "latitude", long = "longitude")

write_rds(property_transaction_sf, "shiny/data/rds/property_transaction_sf.rds")
property <- read_rds("shiny/data/rds/property_transaction_sf_notnull.rds")
property <- property %>%
  st_as_sf(coords = c("longitude", "latitude"), crs = 4326)
# write_rds(property_transaction_sf_notnull, "shiny/data/rds/property_transaction_sf_notnull.rds")

TH3.3.3 Final Preprocessing

Finally, to make our dataset compatible for machine learning use, we need to encode two categorical variables we want to ingest: the Property Type and tenure.

Since Property Type is multiclass categorical, we need to perform one-hot encoding. The following code chunk performs one-hot encoding on the Property Type feature.

dummies <- model.matrix(~ `Property Type` - 1, data = property) %>% as.data.frame()
colnames(dummies) <- gsub("Property Type", "", colnames(dummies))
property <- property %>%
  bind_cols(dummies) %>%
select(-`Property Type`)

Since Tenure is binary categorical (the property is either freehold or leasehold), we can perform binary encoding on this feature.

property <- property %>%
  mutate(is_leasehold = ifelse(Tenure == "Leasehold", 1, 0)) %>%
  select(-Tenure)

Finally, to help us join the property data with our polygon data later, we need to modify the “Mukim” column in the property data column to match that in the polygon data column.

property <- property %>%
  mutate(Mukim = if_else(
    !startsWith(Mukim, "Bandar"),
    paste("Mukim", Mukim),
    Mukim
  )) %>%
  mutate(Mukim = toupper(Mukim))

We also need to parse the number in the Transaction Price column and convert the price from Malaysian Ringgit to US dollars.

property <- property %>%
  mutate(
    `Transaction Price` = parse_number(`Transaction Price`),
    price_USD = `Transaction Price` * 0.21
  ) %>%
  select(-`Transaction Price`)

The last step is to remove the columns we will not need.

property <- property %>%
  select(-`District`, -`Scheme Name/Area`, -`Road Name`, -`Month, Year of Transaction Date`, -`Unit...9`, -`Unit...11`, -`Unit Level`, -`Address`, -`...14`)

write_rds(property, "shiny/data/rds/property_preprocessed.rds")

To visualise the results, we can now plot our work onto a map.

tm_shape(adm3_jb_kulai)+
  tm_borders() +
tm_shape(property) +  
  tm_dots(col = "price_USD",
          alpha = 0.6,
          style="quantile") +
  tm_view(set.zoom.limits = c(11,14))
tmap_mode('plot')
tmap mode set to plotting